Bubble statistics and positioning in superhelically stressed DNA 
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We present a general framework to study the thermodynamic denaturation of double-stranded 
DNA under superhelical stress. We report calculations of position- and size-dependent opening 
probabilities for bubbles along the sequence. Our results are obtained from transfer-matrix solutions 
of the Zimm-Bragg model for unconstrained DNA and of a self-consistent linearization of the Benham 
model for superhelical DNA. The numerical efficiency of our method allows for the analysis of entire 
genomes and of random sequences of corresponding length (10^10^ base pairs). We show that, at 
physiological conditions, opening in superhelical DNA is strongly cooperative with average bubble 
sizes of 10^10^ base pairs (bp), and orders of magnitude higher than in unconstrained DNA. In 
heterogeneous sequences, the average degree of base-pair opening is self- averaging, while bubble 
localization and statistics are dominated by sequence disorder. Compared to random sequences with 
identical GC-content, genomic DNA has a significantly increased probability to open large bubbles 
under superhelical stress. These bubbles are frequently located directly upstream of transcription 
start sites. 

PACS numbers: 87.14.gk, 87.15.A-, 36.20.Ey, 
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INTRODUCTION 



Fundamental processes, such as transcription and 
replication require a transient, local opening of the DNA 
double helix [i|,[2|- Such bubbles occur spontaneously un- 
der physiological conditions [sl, Q , while complete melting 
and the separation of the two complementary strands re- 
quires temperatures around 80^C Bubbles have been 
implicated as an explanation for high cyclization rates in 
short DNA fragments [H, 0] , but their main interest lies 
in biology and in the physical mechanism underlying the 
functioning and control of transcription and replication 
start sites: the stability of DNA is sequence-dependent [11 
and opening is strongly influenced by superhelical stress 
lol-llH and the binding of regulatory proteins [H, [l3| . In 
particular, Benham and coworkers showed significant cor- 
relations between the positions of strongly stress-induced 
destabilized regions and regulatory sites [l3-[l6|. 
Several models exist to describe the internal opening 
of DNA. The Peyrard-Bishop-Dauxois [13, IHj and the 
Poland-Scheraga (PS) models [HI, [2Q| have already been 
used to quantify bubble statistics |2ll-[28| and for the 
ab initio annotation of genomes on the basis of corre- 
lations between biological function and thermal melting 
[29I-I3H. Here, we describe the bubble statistics of ran- 
dom and biological sequences at physiological conditions 
using (i) the Zimm-Bragg (ZB) model, an efficient ap- 
proximation of the PS model [311], and (ii) the Benham 
model [lo|, a generalization of the ZB model account- 
ing for superhelical density but neglecting writhe. After 
exploring the bubble statistics of unconstrained DNA us- 
ing the ZB model, we introduce an asymptotically exact 
self-consistent linearization [32| of the Benham model as 
a precise and convenient tool to study the huge impact 
of superhelicity on local bubble opening. The numerical 



efficiency of the method allows us to investigate the bub- 
ble statistics for entire genomes and random sequences of 
sufficient length (10^ — 10^ bp) to obtain statistically sig- 
nificant results for sequence effects on bubble statistics 
and positioning. In a final step, we correlate the posi- 
tions of highly probable bubbles within the genome of E. 
coli with the position of transcription start sites (TSS) 
and start codons. 



II. MODELS AND METHODS 



Unconstrained DNA 



The Zimm-Bragg model 



The most widely-used model to treat the denatura- 
tion of DNA chains is the PS model (l9| which offers 
predictive power for thermal melting and strand dissoci- 
ation for DNA of arbitrary length, strand concentration 
and a wide range of ionic conditions. For long heteroge- 
neous sequences, whose local denaturation is dominated 
by the quenched sequence disorder (26l-[28|, the related 
and computationally faster ZB model gives surprisingly 
good results [3l|. In the PS and ZB formalisms, the free 
energy of a given configuration is decomposed into for- 
mation free energies for closed base-pair steps and free 
energy penalties for the nucleation of unpaired regions 
(or bubbles). Using periodic boundary conditions, the 
ZB Hamiltonian for a circular chain of length TV can be 
expressed as [3l| 
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nzB = {{Agi - AgioopWiOi^i + AgioopOi} (1) 
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where Oi = (1) if base-pair i is open (closed). Agi 
(~ ki)T) is the nearest-neighbor (NN) free energy to form 
the base-pair step (i,z + l). Agioop = —kBTlog[apsL~^] 
is the loop nucleation penalty and depends on the PS co- 
operativity aps ~ 1Q~^, on the interacting self-avoiding 
loop exponent c ^ 2.1 [H, [HI and on a typical bubble 
length L {Agioop ^ ^OhT for L = 170). Parame- 
ters are taken at physiological conditions for temperature 
T = 37^ C and salt concentration [Na^] = 0.1 M (slj. 



2. Transfer-matrix method 

Being formulated as a ID Ising model, the model is 
easily solved analytically (numerically) for homogeneous 
(heterogeneous) sequences using transfer matrix meth- 
ods. The partition function of the system is then given 
by M 



Z — Trace 



N 



the individual closing probability by 
1 



(2) 



Z 



-Trace 



N 



n^.p/' n 

j=i / Vi=i+i 



(3) 



In particular, it is possible to calculate position resolved 
opening probabilities, P^,/, for bubbles containing / open 
bps and beginning at the closed bp i, by 
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The global closing probability B and the probability per 
bp Pi to observe a bubble of length / are given by B = 
{Ef=i{Oi))/N and Pi = {Ef^lP^,l)/N. Basic summing 
rules on the bubble probabilities imply that 1 — B = 
IPi, and that C = (E, /PO/(E Pi) = (1 - 
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where jC is the average bubble length and P5 = Pi is 
the probability per bp to observe a bubble of arbitrary 
length. 



For homogeneous sequences, these equations could be 
solved easily. In the asymptotic limit A/" ^ 00, it leads 
to 
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with g = exp[— /3A^] and s = ex.p[—l3 Agioop]- The bubble 
probability is therefore a decreasing exponential function 
of / for homogeneous sequences. 

For heterogeneous sequences, we numerically compute 
the different desired observables in a 0(A/')-algorithm. In 
a first step, we iteratively compute the forward and back- 
ward products, Fi = (iVj^iPj) and Bi = (n^^i^i)- 
The second step consists in applying equations [21 [3] and 

m 



B. Superhelical DNA 

In living organisms, DNA is highly topologically con- 
strained into circular domains (closed-loops or circular 
molecules) [l| . Each closed domain is defined by a topo- 
logical invariant L, the so-called linking number. L rep- 
resents the algebraic number of turns either strand of the 
DNA makes around the other. It can be decomposed in 
two contributions: the twist which is the number of turns 
of the double- helix around its central axis, and the writhe 
which is the number of coils of the double-helix. In the 
majority of living systems, the average linking number L 
is below the characteristic linking number value Lo of the 
corresponding unconstrained linear DNA, due to a neg- 
ative superhelical density a = aA/N ^ —0.06 imposed 
by protein machineries, where a = L — Lo is the linking 
number difference. 



1. The Benham model 

In this article, we consider bubble openings in su- 
perhelically constrained circular DNA using the Ben- 
ham model for an imposed superhelical density a, where 
the standard thermodynamic description of base-pairing 
{^zb) is coupled with torsional stress energetics [ol, lioj . 
For each state, if one neglects the writhe contribution, 
the imposed linking difference a can be decomposed in 
three contributions: 1) the denaturation of base-pairs 
relaxes the helicity by —Uo/A where A = 10.4 bp/turn 
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FIG. 1: Effective super helical hamiltonian 1-Ltw per bp as a 
function of the relative opening rio/N in the asymptotic limit, 
for different stress values: cr = (squares), -0.03 (stars), -0.06 
(circles) and -0.1 (crosses). 



is the number of base-pair in a helical turn; 2) the re- 
sulting single-strand regions can twist around each other 
inducing a global over-twist of T; 3) then, the bending 
and twisting of double-stranded parts is put in the resid- 
ual linking number a^. Therefore, due to the topological 
invar iance of a, we get the closure relation 
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For denatured regions, the high flexibility of single- 
stranded DNA allows unpaired strands to interwind. The 
energy associated with the helical twist (in rad/bp) of 
open base-pair i is 



(10) 



where the torsional stiffness C is known from experi- 
ments, C ^ 3.09/cbT. The individual twist are re- 
lated to the global over-twist T via the relation T = 
^^(1 — Oi)ri/ {27r). For paired helical regions, it has been 
experimentally found that superhelical deformations in- 
duce an elastic energy, quadratic in the residual linking 
difference 
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where Uq = — Oi) is the number of open base-pairs 

and K = K' /N ^ 2220kBT/N. By integrating over 
the Uo continuous degrees of freedom , the superhelical 
stress energetics is represented by a non-linear effective 
Hamiltonian lOl: 
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Htw is minimal for a non-zero number of opening base- 
pairs which increases as the stress strength increases (see 
figure [1]) . 

The total effective Hamiltonian is given by ^^ef f{{^i}) = 
nzB{{Oi}) + nrwirio)- We fix Agioop = 20hT. 



2. Torque-imposed ensemble 

The Benham Hamiltonian is defined in a superhelical 
density (<j)-imposed ensemble defined by equation [9l In 
this section, we briefly discuss similar model but in the 
torque-imposed ensemble. 

In this ensemble, a linear DNA segment (length TV) is 
constraint by a weak torque F applied on base the 
first bp being fixed. For each bp i, we define uji its orien- 
tation in the plane perpendicular to the average axis of 
the double- helix and oriented in the 5' to 3' direction (for 
a denatured bp-step, SuJi = cj^+i — uJi = r^/(27r) of the 
Benham model). The total Hamiltonian of the system is 
then given by [33|] 
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- 0^{SUJ^ - Uof - T{UJN - UJl) (13) 



with Kr = i^ry (47r^) and uq = 2i\ j A the natural helical 
twist. Writing {ujn —^\) = ^^i^ and integrating over 
the SuJi^ leads to the effective Hamiltonian (relatively to 
the situation without torque): 
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Applying the constant torque F is therefore equivalent to 
adding an external field = -[T'^{1/{2C) - l/{2Kr)) - 
Tujq] in the ZB formalism. 



C. Self-consistent linearization 

1. Solving the Benham model 

Recently, Jeon et al [39| derived formulas to solve the 
Benham model for homogeneous and random heteroge- 
neous sequences, including the computation of sequence- 
average bubble properties. The computation is based on 
a reorganization of the partition function sum into partial 
sums for fixed numbers of bubbles. 

For heterogeneous sequences, Fye and Benham [lo| 
proposed an exact 0(A^^)-algorithm by decomposing the 
prefactor exp[— /3HTiy(^o)] in discrete Fourier modes, 
and by using the transfer-matrix method. Benham and 
coworkers [sj, [lH, [13 have also developed an approxi- 
mate method which first involves a windowing procedure 
to find the minimum free energy and then consider only 
the states whose energies do not exceed the minimum one 
by more than a given threshold. At high threshold val- 
ues or high negative superhelicity, the computation time 
for this algorithm scales exponentially with the threshold 
and the superhelical stress. Both schemes are still time 
demanding for very long sequences. 
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2. Self- consistent field 

In order to speed up the resolution of the Benham 
model and to access directly to position-dependent open- 
ing properties of bps and bubbles, we develop an efficient 
variational method [32| allowing us to use the transfer- 
matrix solution of the ZB model. For long sequences, 
assuming that fluctuations of Uo are small around the 
value no, we can expand Hrwi^o) around n^. The ap- 
proximated effective Hamiltonian then takes the typical 
ZB form: 

N 

Ueff ^ nzBm}) + nrwifio) + (iV - no)h -hY,0^ 

(15) 

where h = {dl-Lrw / dno){no) represents the mean- field of 
our approximation. If one imposes the superhelical stress 
r (ie, the torque) instead of the superhelical density, h 
is related to the effective field {Att'^N/K - l/C){T'^/2) + 
27tT/A generated by the torque (see above). 
In the following, we employ the Benham ensemble of an 
imposed superhelical density, a, and determine h (or F) 
self-consistently. Self-consistency requires {no){no) = ^o, 
and is equivalent to solving 

h = {dnTw/dno){{no){h)), (16) 

because the function {dl-Lrw / dno){no) is monotonic. 



ically. However, at low temperatures, weak superhelical 
densities and in the limit of infinitely long chains, a small 
perturbation development is valid and leads to analyti- 
cal expressions for h. For infinitely long chains, 1-Ltw 
becomes 

with X rio/N, K' = NK and / = 

-log[27r/(/3C)]/(2/3). Assuming that the fraction of 
open base-pairs x is small {x « 1), dl-Lrw /duo = 
d{l-LTW /N) / dx is given by 

(18) 

Using EqI7]and noting y = exp[f3h], we also have 

sy 



(19) 



In the limit gy >> 1, inserting x « s/{g^y) in Eg ITSl and 
solving EqUni leads to 
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This expression is valid until gy ^ 1 (ft. « A^), i.e. 



(20) 



3. Homogeneous sequences 

For homogeneous sequences, the general solution of the 
self-consistency equation [T6l cannot be computed analyt- 



(21) 

For a < ai, we could write gy = l-\-e (with e > 0). Then, 
X ^ {s/g)/e'^ and h ^ + e. Solution of Eq(T6] leads to 
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(22) 



Figure [2] shows that the numerical solution of EqlTHl 
agrees very well with the two expressions found above 
(Eq|20]and[22]). 

To compute bp or bubbles properties, formulas O [7] and 
[8] are still available if one replaces g hy gy and s by sy. 



4- Heterogeneous sequences 

For heterogeneous sequences, we use the bisection 
method coupled to the Newton-Raphson method |36|] to 
numerically solve the self-consistency equation, for fixed 
values of the temperature and of the superhelical density. 
Knowing that an evaluation of the function / requires 



one transfer matrix method computation {0{N) each), 
it takes typically 10 — 20 evaluations to determine the 
root with a relative precision of 10~^. This allows nu- 
merically efficient computation of denaturation profiles. 
For example, computing the local closing probabilities 
for the E.coli genome {N ~ 4.6 Mbps) takes about 70 
seconds on a 2.4 GHz computer with the self-consistent 
method, whatever the density is. On the same computer, 
it would take about 10^^ s with the exact method {0{N'^) 
algorithm) for any a values (interpolation of data given 
in Ref.[io[), and about 6.10^ s with the approximate 
method for a = —0.055 ( and around 40 times more for 
a = -0.075) [3i,[35|,i37,]. 
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FIG. 2: Numerical (dots) or analytical (lines) solutions of the 
self-consistency equation Eq[T6] for a homogeneous sequence 
(A^ = -3.14/cbT). 
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FIG. 3: (Color online) Opening free energy — log[l — (Oi)] for 
the extended neighborhood of the TSS of Fig. IIOI computed 
using the Benham's web server [s^] (A) or using our formalism 
(B). 



5. Comparison with the Benham model 

The self-consistent linearization consists in working in 
the torque-imposed ensemble and determining the torque 
self-consistently to better approximate the superhelical 
density-imposed ensemble. This representation has the 
advantage of decoupling the opening of different parts of 
the molecule and is probably the simplest way to express 
the destabilizing effect of undertwisting on DNA stabil- 
ity. In the thermodynamic limit, i.e., for very long (ge- 
nomic) sequences and at low temperature (well below the 
melting temperature), we expect our linearization to be 
a quasi-exact solution of the Benham model due to the 
asymptotic decrease of fluctuations within the system. 
For shorter sequences, however, the non-linearity of the 
effective superhelical Hamiltonian Htw (see Eq fT2|) may 
significantly couple remote domains along the sequence, 
leading for example, to the closing of an open domain 
as one increases the superhelical stress (as observed in 
Fig|3]A). Although, the self-consistent linearization ne- 
glects such effects, it gives a reliable general picture of the 
superhelically-stressed destabilization of DNA sequences. 
In figure [3l we show a comparison of results obtained 




Temperature (K) 

FIG. 4: Evolution of the global opening probability 1 — G 
as a function of temperature for an unconstrained (stars) or 
constrained {a = 0: squares, a = —0.06: circles, a = —0.2: 
crosses) random sequence (GC=0.5). 



from the Benham web server [s^, [sll and by our method. 
The agreement is excellent and the small deviations are 
mostly due to the slightly different parametrizations of 
the ZB parts and to the absence of finite size effect in our 
approach as discussed above. 



III. RESULTS AND DISCUSSION 

In the following, we discuss results obtained for random 
homogeneous and heterogeneous sequences, as well as 
for some bacterial genomes {E.coli^ T. whipplei, A. Bau- 
manii, B. subtilis and S. coelicolor, see Table [Tj). The re- 
sults shown for random heterogeneous sequences were ob- 
tained by compiling profiles from 100 random sequences 
each containing 10^ bp. 



A. Melting of superhelical DNA 

The melting properties of constrained DNAs reflect a 
balance between the two parts of He//. At 37^ C, Hzb 
opposes local opening the more strongly the higher the 
GC-content of the sequence. In contrast, Htw is min- 
imal for a flnite number of open base-pairs, which in- 
creases as a becomes more negative. At biological lev- 
els, the free energy of completely closed DNA becomes 
prohibitively large (see Fig(T]). As a consequence, su- 
perhelicity leads to a notable level of base-pair opening 
at physiological temperatures, where unconstrained DNA 
exhibits negligible breathing. 

Figure m shows the overall impact on the opening prob- 
ability of imposing a superhelical constraint. For com- 
parison, we have also included a melting curve for un- 
constrained DNA. At physiological temperatures, the un- 
constrained DNA is very stable whereas the superhelic- 
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FIG. 5: (A) Melting temperature for random sequences (GCg 
[0, 1]) at different salt concentrations ([A^a+] G [0.05, 1] M) as 
a function of the superhelical density a (dots), respectively to 
the situation with a = 0. (B) Melting temperature at a = 
for random sequences as a function of their GC-content, at 
[A/'a^] = 0.05 (circles), 0.1 (squares) and 0.3 M (crosses), re- 
spectively to the corresponding unconstrained (ZB) situation. 
Dashed lines represent the fitted relation given in Eq|23l 



ity significantly contributes to opening of base pairs and 
bubbles. However, for intermediate and weak negative 
stresses, the destabilizing effect of an imposed superheli- 
cal density is reversed close to the melting temperature 
due to the overall stabilizing impact of untwisting on the 
rest of the DNA {Urwirio = N/2) > 0), resulting in a 
slowdown of the melting process via a change of T (or h) 
with temperature. This effect has already been pointed 
out for positively-stressed homogeneous molecules (iof . 
For stronger stresses, the effective twisting Hamiltonian 
Htw at the melting transition {uq = N/2) is still neg- 
ative and then results in a decrease of the melting tem- 
perature. A quantification on this salt-, GC-dependent 
eflFect is described below. 

On figure [5] A, we observe that, under the different con- 
sidered GC-contents and salt concentrations, the melting 
temperature of random superhelical DNA is a quadratic 
function of a, and only the intercept of this function 
(Tm{cF = 0)) depends on the GC and on [7Va+] (see figure 
[5]B). From a systematic study of the melting tempera- 
ture Tm as a function of the GC-content and the salt 
concentration, we fitted the empirical relation (in ^ C): 

AT^([7Va+],GC,a) = 2.35 - 0.54 log[7Va+] - 3.42GC 
-0.93GC^ + 4.7a - 46.80"^ (23) 

where AT^ is the difference in melting temperature be- 
tween a constrained and an unconstrained random DNA 
polymers. 

Figure [6] shows the dependance of the opening proba- 
bility 1 — O as a function of the GC-content. Results are 
reported for different levels of superhelical density and at 
T = 37^ C, in the biological relevant range, i.e the overall 
degree of opening is small yet strongly increased relative 
to the unconstrained case (see also FigH]). We have in- 
cluded results for both homogeneous and heterogeneous 
systems. In the former case, the employed NN-free ener- 
gies Ag were determined as composition dependent av- 
erages over the tabulated step parameters [sf, [20| . 
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FIG. 6: (Color online) Evolution at T = 37^ C of the 
global opening probability 1 — G for superhelically constrained 
random heterogeneous (full lines) and homogeneous (dashed 
lines) DNAs under different a (from -0.02 to -0.1 every 0.01: 
color scale from light to dark red, a = —0.06: blue line) 
and for several bacterial genomes at a = —0.06: E. coli (tri- 
angles), T. whipplei (squares), A. Baumanii (diamonds), B. 
subtilis (circles) and 5*. coelicolor (stars). Inset: proportion of 
the different contributions to the linking number difference a: 
relaxation due to denaturation —rio/N (I), over-twist of the 
denatured regions T (II) and residual linking number ar (III) 
(see EqEl). 



Passing a threshold [s^ (see Inset in Figl6]) depending 
on the GC-content, strong a- values allow the opening 
of many bp along the sequences. For a fixed superheli- 
cal stress, as expected, 1 — B is a decreasing function of 
the GC-content. Differences between homogeneous and 
heterogeneous are weak, meaning that sequence hetero- 
geneity self-averages and has only a small effect on the 
total degree of opening. 

The inset in figure [6] shows the different contributions 
to the linking number difference as a function of the su- 
perhelical density (see Eql9]) for random sequences with 
GC=0.5. The over-twist T contribution is estimated by 
integrating over the continuous degrees of freedom 
[10] and applying the self-consistent linearization: 
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We observe that the over-twist increases linearly with 
the opening probability. While for weak stresses, almost 
all the superhelical energy is stored in the (residual) de- 
formations of the double-helix, for strong stresses, more 
than 50 % of the imposed superhelical constraint is used 
to drive the local denaturation of bps. Accounting for the 
writhe in the model should however decrease the contri- 
butions due to bp-denaturation and over-twisting since a 
part of the constraint would be absorbed by coils of the 
double-helix. 
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FIG. 7: (Color online) Evolution of the bubble probability 
per bp Pi to observe a bubble of length /. (A) For uncon- 
strained molecules: random sequences of different GC-content 
(GC from to 1 every 0.1: color scale from light to dark 
red, GC=0.5: squares) and the genome of E. coli (dashed 
line with crosses, GC=0.51). Inset: average bubble length 
£ as a function of the GC-content. (B) For super helically- 
stressed molecules: a random sequence {GC = 0.5, squares), 
the genome of E. coli (crosses) and a homogeneous sequence 
(with Ag = —1.9kBT equals to the average NN-free energies 
in E. coli, circles), and cr = (dotted and dashed lines), —0.06 
(full lines) and -0.1 (dashed lines). 



Sequence heterogeneity and bubble statistics in 
superhelical DNA 

1. Bubble statistics in unconstrained DNA 



= 5x10 

CO 




0.4 0.6 
GC-content 



Figure [JK shows the bubble probabilities computed 
with the ZB model for random DNAs of different GC- 
content, with Agioop = lOkbT consistent with very small 
loops {L ^ C ^ 1, see inset in Fig|7lA). We remark an 
exponential decrease of Pi with very short decay lengths, 
corresponding to fairly closed molecules. Increasing GC- 
content stabilizes the DNA and reduces average bubble 
lengths. Even accounting for the weak biological se- 
quence effect apparent in our results, the absolute level of 
bubble opening in unconstrained DNA seems too small 
for natural breathing to play a direct biological role [ll| . 

Whereas we obtain similar £- values as reported for 
the Peyrard-Bishop-Dauxois (PBD) model [23|, absolute 
probabilities Pi are far lower (for example Pi ~ 10~^^ 
(ZB) versus - IQ-^ (PBD) for / = 10 and GC=0.5), 
mainly due to the absence of a cooperativity penalty fac- 
tor preventing bubble formation in the PBD model. 



2. Bubble statistics in superhelical DNA 

Figure [7p highlights the huge impact of the superhe- 
lical stress on the bubble statistics. For physiological 
levels, the opening probability per bp remains significant 
even for large bubbles and has increased by many orders 



FIG. 8: (Color online) Evolution of bubble occurrence proba- 
bility Pb (A,B) and of the average bubble length C (C,D) for 
superhelically constrained random heterogeneous (B,D) and 
homogeneous (A,C) DNAs under different stresses a (from 
-0.02 to -0.1 every 0.01: color scale from light to dark red, 
a = —0.06: blue line) and for several bacterial genomes at 
a — —0.06: E. coli (triangles), T. whipplei (squares), A. 
Baumanii (diamonds), B. subtilis (circles) and S. coelicolor 
(stars) . 



of magnitude compared to the unconstrained situation. 
For example, for GC=0.5 and a = -0.06, Pi=io - 10"^ 
and P/=60 ^ 10~^ for superhelical DNA, while for the cor- 
responding unconstrained DNA, we found Pi^io ~ 10~^^ 
and Pi^eo ^ 10"^^ 

Figure [8] shows the evolution of the bubble occurence 
P5 and of the average bubble length C for different a- 
values and GC-content. With Hrwi^o) only being a 
function of the total number of open base-pairs, bubble 
sizes in homogeneous systems are determined by a com- 
petition between the bubble initiation penalty, Agioop, 
favoring the opening of a small number of large bubbles, 
and entropy, favoring the opening of a large number of 
small bubbles. In heterogeneous systems, it is possible 
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FIG. 9: (Color online) Evolution of the average GC-content of 
bubbles for super helically constrained random heterogeneous 
DNAs under different stresses a (from to -0.1 every 0.01: 
color scale from light to dark red, a = —0.06: blue line). The 
dashed line represents results for homogeneous sequences. 



to lower the fraction of stable GC-steps in the open do- 
mains by denaturing a larger number of smaller bubbles 
in particularly AT-rich regions (see Figl9j) [4l|. The com- 
parison in Figs. [8] and [9] shows that the disorder effect 
dominates in the present case (26l-[28| with the number of 
bubbles being maximal around GC=0.5. For biological 
super helicities, C ^ 700 for random AT- and GC-DNAs, 
while C ^ 150 for 0.2 < foe < 0.9. 

Sequence-heterogeneity plays an essential role by low- 
ering the fraction of stable GC-steps in the open domains 
and leads to a localization of open base-pairs in (AT- 
enriched) stress-induced duplex destabilized regions 
[l6|, whose length in turn limits the bubble sizes. Indeed, 
figure [9] shows that bubbles appear mainly in AT-enriched 
regions compared to the background GC-content. Inter- 
estingly, for intermediate (biological) a values, the evolu- 
tion of the GC-composition of bubbles remains flat over 
a large range of global GC-content. 



C. Bubble statistics in biological DNA 

Compared to random sequences with identical GC- 
content, bacterial genomes {E.coli^ T. whipplei^ A. Bau- 
manii, B. subtilis and S. coelicolor) exhibit higher av- 
erages degrees of opening and increased average bubble 
sizes (symbols in Figs l6] and [8]). Figure [7p shows strik- 
ing discrepancies in the bubble distribution Pi between 
random and biological sequences, with a significant en- 
richment of large bubbles. In the latter case, the nearly 
flat distributions resemble those expected for homoge- 
neous sequences and can be viewed as a signature of large 
stress-induced destabilized domains, which concentrate 
the DNA breathing into large regions with homogeneous 
opening profiles. We also note that the level of opening 
in the biological domains is comparatively insensitive to 
the precise level of the superhelical density. 



TABLE I: Z-scores computed relatively to the average bubble 
length for 5 prokaryotic organisms. 



Organism 


N (Mbp) GC-content Z 


-score 


A. baumanii 


3.98 


0.40 


23.0 


B. subtilis 


4.21 


0.44 


21.0 


E. coli 


4.64 


0.50 


25.7 


S. coelicolor 


8.67 


0.70 


17.6 


T. whipplei 


0.93 


0.46 


10.6 



In general, biological sequences are more destabilized 
than random sequences with a same GC-content, leading 
to less but longer bubbles. We estimate these differences 
by computing, for all genomes, the Z-score relatively to 
C (see Tab HI . The Z-score is an estimation of the non- 
randomness of a specific sample. For the average bubble 
length £, it can be computed by 

^ ^ Cjgenome) - (^(random)) ^^^^ 

(random) 

where (>C(random)) and cr/:(random) are the mean and 
the standard deviation of C computed on 100 random 
sequences with the same GC and length as the corre- 
sponding genome. The Z-score represents therefore the 
distance between a specific point and the mean value ob- 
tained for random sequences, given in standard deviation 
units. Z-scores of 20 for C suggests the presence of over- 
represented long AT-rich domains in bacterial genomes 
which have the ability to easily open under superheli- 
cal stress. Interestingly, the organism with the lowest 
Z-score (T. whipplei) was shown to exhibit a random- 
like behavior relatively to the local melting temperature 
distribution (3ll |. 



D. Bubble positioning in the promoter regions of 
biological DNA 

Positions of strongly stress-induced destabilized re- 
gions have been shown to correlate with many regula- 
tory regions [l6| including transcription start sites [T^ 
or origins of replication [15|. In this section, we focus 
on bubble positioning inside the promoter regions of the 
bacterium E. coli. 

The transcription of DNA is initiated by the local open- 
ing of the double-helix at transcription start sites. Fig- 
ure [3] illustrates their association with strongly stress- 
induced destabilized regions. In addition to position- 
dependent opening probabilities, our approach allows us 
to calculate the complete bubble free-energy landscape, 
Gi,i = — ksT log Pi^i. Figure [To] shows the effect of super- 
helicity on Gi^i for the neighborhood of the same TSS in 
E. coli. The analysis reveals that opening is the result of 
the meta-stable unwinding of a large bubble and not of 
enhanced small scale breathing. We note that knowledge 
of Gi^i is essential for modeling the dynamics of bubble 
nucleation and growth [42[. 
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FIG. 10: (Color online) Free energies to open a bub- 
ble of a given length centered around a given nucleotide 
(— /csTlog[P^,z]) for the neighborhood of a transcription start 
site (bp n^ 259382) in the E. coli genome, in the presence 
(B, C, D) or in the absence (A) of an imposed superhelical 
density a = -0.03 (B), -0.06 (C) and -0.10 (D). 



Figure [TT] analyses the statistical relation beween TSSs 
and bubbles induced by superhelical stress for the en- 
tire genome of E. coli. Figure [TTl V shows a significant 
and non-random destabilization around TSSs, with a 
maximal opening around —80. The same computation 
using the ZB model shows insignificant and orders-of- 
magnitude smaller opening probabilities. However, we 
find a non-random signal around TSS-10 corresponding 
to the position of the AT-rich Pribnow box, an essen- 
tial motif to start transcription in bacteria [i^. Fig- 
ure [TTB gives the relative positions of highly probable 
bubbles included in TSS neighborhoods. The centers of 
these bubbles are mainly localized in the [TSS-200,TSS] 
region where many transcriptional and promoter factors 
are recruited and bind to DNA [isj. Conversely, figure 
WVC confirms that the majority of highly probable bub- 
bles are situated upstream and close to start codons of 
genes. Actually, these bubbles are composed by around 
36% of coding bps, significantly lower than the percent- 
age of coding bps in E. coli (88%). 




position from TSS (bp) position from start codon (bp) 

FIG. 11: (A) Opening probability as a function of the posi- 
tion relatively to the TSS, averaged on 760 promoter regions 
of E. coli [43[ (black hne), or on 760 randomly picked re- 
gions inside the genome (blue line), for a constrained (top) 
or unconstrained (down) DNA. (B) Histogram of the highly 
probable {Pi,i > 10~^) bubble centers included in the regions 
[TSS-300,Ts's+300] for the 760 studied TSS (black bars) or 
for randomly picked regions (blue line). (C) Probability distri- 
bution function of the distance between highly probable bub- 
ble centers and the nearest start codons for the 1158 actually 
found (black line) or randomly situated (blue line) bubbles. 



IV. SUMMARY AND CONCLUSION 



We have developed a numerically efficient, self- 
consistent solution of the Benham model of bubble open- 
ing in superhelically constrained DNA. In particular, we 
are able to go beyond the calculation of opening prob- 
abilities for base pairs and to address the full, position- 
dependent bubble statistics for entire genomes. Our re- 
sults indicate, that negative supercoiling leads to (meta-) 
stable unwinding of bubbles comprising 0(100 — 1000) 
base-pairs. In heterogeneous sequences, bubbles are 
strongly localized in AT-rich domains with sequence dis- 
order dominating the bubble statistics. As we have 
shown, large bubbles open with a significantly larger 
probability in biological sequences, than in random se- 
quences with identical GC-content. In the case of E. 
coli^ the most likely bubbles are located directly upstream 
from transcription start sites, highlighting the biological 
importance of this now well understood, physical prop- 
erty of DNA. 
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